'harvard.mp3') Audio(
Speech Recognition Using Librosa
Introduction to Librosa
Speech recognition has become a pivotal technology in various applications, from virtual assistants to transcription services. One popular Python library that aids in processing and analyzing audio signals for speech recognition is Librosa. In this blog, we will explore how to use Librosa for basic speech recognition tasks and understand its role in audio analysis.
Importing libraries
Library/Module | Description |
---|---|
torch |
A deep learning framework for building and training neural networks. |
transformers.Wav2Vec2ForCTC |
A pre-trained model for automatic speech recognition using Connectionist Temporal Classification. |
transformers.Wav2Vec2Tokenizer |
A tokenizer to preprocess audio inputs for the Wav2Vec2 model. |
transformers.pipeline |
A simple interface for running tasks like speech-to-text, translation, and more using pre-trained models. |
librosa |
A Python library for audio analysis and feature extraction. |
vaderSentiment.SentimentIntensityAnalyzer |
A tool for sentiment analysis, particularly effective for short texts like sentences. Ipython |
Reading Audio File
The voice data features diverse sentences focusing on sensory experiences and cultural food references, making it ideal for testing speech recognition and contextual comprehension.
Implementation Steps
1. Load the Wav2Vec 2.0 Model and Tokenizer
The
Wav2Vec2Tokenizer
andWav2Vec2ForCTC
are loaded from the pre-trained Wav2Vec 2.0 deep learning model provided by Facebook to transcribe spoken words into written text .The
Wav2Vec2Tokenizer
is responsible for converting raw audio input into a format that the Wav2Vec 2.0 model can understandThe
Wav2Vec2ForCTC
class represents the Wav2Vec 2.0 model itself, specifically designed for CTC (Connectionist Temporal Classification) training
# Load the Wav2Vec 2.0 model and tokenizer
= Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-large-960h")
tokenizer = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h") model
2. Load and Preprocess the Audio
- The audio file harvard.wav is loaded using librosa.load
y
: Contains the audio data (samples)
sr
: The sample rate (16,000 Hz in this case), which is the standard rate for the Wav2Vec 2.0 model Higher sample rate captures more details, providing a clearer and higher-resolution audio file
- The raw audio data (y) is tokenized using the tokenizer It converts the audio into a format that the model can understand and prepares it for input. The return_tensors=“pt” option specifies that the output should be in PyTorch tensor format
# Load and preprocess the audio
= "harvard.wav"
audio_file = librosa.load(audio_file, sr=16000) # Wav2Vec2 works best with 16kHz audio
y, sr = tokenizer(y, return_tensors="pt", padding="longest").input_values input_values
3. Perform Speech Recognition
torch.no_grad()
: This context manager disables gradient calculations, saving memory and computation since we are only interested in inference(i.e., generate predictions, not trainingThe model outputs logits, which are raw predictions.
torch.argmax
is used to find the most likely predicted tokens The predicted token IDs are then decoded back into human-readable text using the tokenizer
# Perform inference
with torch.no_grad():
= model(input_values).logits
logits
# Decode the output
= torch.argmax(logits, dim=-1)
predicted_ids = tokenizer.decode(predicted_ids[0])
transcription
print("Transcribed text:", transcription)
Transcribed text: THE STALE SMELL OF OLD BEER LINGERS IT TAKES HEAT TO BRING OUT THE ODOUR A COLD DIP RESTORES HEALTH AND ZEST A SALT PICKLE TASTES FINE WITH HAM TAKO'S AL PASTOR ARE MY FAVORITE A ZESTFUL FOOD IS THE HOT CROSS BUN
4. Sentiment Analysis with VADER
Once the text is transcribed, we use VADER (Valence Aware Dictionary and sEntiment Reasoner) to perform sentiment analysis.
This tool provides sentiment scores for positive, neutral, and negative sentiments, along with a compound score representing the overall sentiment.
# Initialize VADER sentiment analyzer
= SentimentIntensityAnalyzer()
vader_analyzer
# Perform VADER sentiment analysis
= vader_analyzer.polarity_scores(transcription)
vader_scores
# Output the VADER sentiment analysis results
print("VADER Sentiment Analysis Result:")
print("Positive:", vader_scores['pos'])
print("Neutral:", vader_scores['neu'])
print("Negative:", vader_scores['neg'])
print("Compound:", vader_scores['compound'])
VADER Sentiment Analysis Result:
Positive: 0.149
Neutral: 0.851
Negative: 0.0
Compound: 0.7184
Conclusion:
Integrating Wav2Vec 2.0 for speech-to-text and VADER for sentiment analysis bridges the gap between audio data and insights. Speech recognition transcribes spoken language, while sentiment analysis uncovers emotional tones. These techniques enhance NLP and audio analysis, enabling smarter, real-world applications.